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Abstract 

We propose a general method called truncated gradient to induce sparsity in the weights 
of online learning algorithms with convex loss functions. This method has several essential 
properties: 

1. The degree of sparsity is continuous — a parameter controls the rate of sparsification from 
no sparsification to total sparsification. 

2. The approach is theoretically motivated, and an instance of it can be regarded as an online 
counterpart of the popular ii-regularization method in the batch setting. We prove that 
small rates of sparsification result in only small additional regret with respect to typical 
online learning guarantees. 

3. The approach works well empirically. 

We apply the approach to several datasets and find that for datasets with large numbers of 
features, substantial sparsity is discoverable. 

1 Introduction 

We are concerned with machine learning over large datasets. As an example, the largest dataset 
we use here has over 10 7 sparse examples and 10 9 features using about 10 11 bytes. In this setting, 
many common approaches fail, simply because they cannot load the dataset into memory or they 
are not sufficiently efficient. There are roughly two approaches which can work: 

1. Parallelize a batch learning algorithm over many machines (e.g., [3]). 

2. Stream the examples to an online learning algorithm (e.g., [9], [10], [2], and [6]). 

This paper focuses on the second approach. 

Typical online learning algorithms have at least one weight for every feature, which is too much 
in some applications for a couple reasons: 

1. Space constraints. If the state of the online learning algorithm overflows RAM it can not 
efficiently run. A similar problem occurs if the state overflows the L2 cache. 



2. Test time constraints on computation. Substantially reducing the number of features can yield 
substantial improvements in the computational time required to evaluate a new sample. 

This paper addresses the problem of inducing sparsity in learned weights while using an online 
learning algorithm. There are several ways to do this wrong for our problem. For example: 

1. Simply adding L\ regularization to the gradient of an online weight update doesn't work 
because gradients don't induce sparsity. The essential difficulty is that a gradient update has 
the form a + b where a and b are two floats. Very few float pairs add to (or any other default 
value) so there is little reason to expect a gradient update to accidentally produce sparsity. 

2. Simply rounding weights to is problematic because a weight may be small due to being 
useless or small because it has been updated only once (either at the beginning of training or 
because the set of features appearing is also sparse). Rounding techniques can also play havoc 
with standard online learning guarantees. 

3. Black-box wrapper approaches which eliminate features and test the impact of the elimination 
are not efficient enough. These approaches typically run an algorithm many times which is 
particularly undesirable with large datasets. 

1.1 What Others Do 

The Lasso algorithm [13] is commonly used to achieve L\ regularization for linear regression. This 
algorithm does not work automatically in an online fashion. There are two formulations of L\ 
regularization. Consider a loss function L(w,Zi) which is convex in w, where Z{ = (xi,yi) is an 
input/output pair. One is the convex constraint formulation 

n 

w = arg min L(w, Zj) subject to ||w||i < s, (1) 
w i=i 

where s is a tunable parameter. The other is soft-regularization, where 

n 

u> = argmin> L(w, Zj) + A||u;||i. (2) 
i=i 

With appropriately chosen A, the two formulations are equivalent. The convex constraint formu- 
lation has a simple online version using the projection idea in [15]. It requires the projection of 
weight w into an L\ ball at every online step. This operation is difficult to implement efficiently for 
large-scale data where we have examples with sparse features and a large number of features. In 
such situation, we require that the number of operations per online step to be linear with respect 
to the number of nonzero features, and independent of the total number of features. Our method, 
which works with the soft-regularization formulation (2), satisfies the requirement. Additional de- 
tails can be found in Section 5. In addition to L\ regularization formulation (2), the family of online 
algorithms we consider in this paper also include some nonconvex sparsification techniques. 

The Forgetron algorithm [4] is an online learning algorithm that manages memory use. It 
operates by decaying the weights on previous examples and then rounding these weights to zero 
when they become small. The Forgetron is stated for kernelized online algorithms, while we are 
concerned with the simple linear setting. When applied to a linear kernel, the Forgetron is not 
computationally or space competitive with approaches operating directly on feature weights. 
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1.2 What We Do 

At a high level, the approach we take is weight decay to a default value. This simple approach 
enjoys a strong performance guarantee, as discussed in section 3. For instance, the algorithm 
never performs much worse than a standard online learning algorithm, and the additional loss due 
to sparsification is controlled continuously with a single real-valued parameter. The theory gives 
a family of algorithms with convex loss functions for inducing sparsity — one per online learning 
algorithm. We instantiate this for square loss and show how to deal with sparse examples efficiently 
in section 4. 

As mentioned in the introduction, we are mainly interested in sparse online methods for large 
scale problems with sparse features. For such problems, our algorithm should satisfy the following 
requirements: 

• The algorithm should be computationally efficient: the number of operations per online step 
should be linear in the number of nonzero features, and independent of the total number of 
features. 

• The algorithm should be memory efficient: it needs to maintain a list of active features, and can 
insert (when the corresponding weight becomes nonzero) and delete (when the corresponding 
weight becomes zero) features dynamically. 

The implementation details, showing that our methods satisfy the above requirements, are provided 
in section 5. 

Theoretical results stating how much sparsity is achieved using this method generally require 
additional assumptions which may or may not be met in practice. Consequently, we rely on ex- 
periments in section 6 to show that our method achieves good sparsity practice. We compare our 
approach to a few others, including L\ regularization on small data, as well as online rounding of 
coefficients to zero. 

2 Online Learning with GD 

In the setting of standard online learning, we are interested in sequential prediction problems where 
repeatedly from i = 1,2,...: 

1. An unlabeled example X{ arrives. 

2. We make a prediction based on existing weights W{ 6 R d . 

3. We observe j/j, let Zi = (xi,yi), and incur some known loss L(wi,Zi) convex in parameter W{. 

4. We update weights according to some rule: Wi + \ <— f{wi). 

We want to come up with an update rule /, which allows us to bound the sum of losses 



i=i 

as well as achieving sparsity. For this purpose, we start with the standard stochastic gradient 
descent rule, which is of the form: 




f(Wi) 



Wi - r]ViL(wi,Zi), 



(3) 
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where V±L(a, b) is a sub-gradient of L(a, b) with respect to the first variable a. The parameter 
rj > is often referred to as the learning rate. In our analysis, we only consider constant learning 
rate with fixed rj > for simplicity. In theory, it might be desirable to have a decaying learning 
rate r\i which becomes smaller when i increases to get the so called no-regret bound without knowing 
T in advance. However, if T is known in advance, one can select a constant 77 accordingly so that 
the regret vanishes as T — ► 00. Since our focus is on sparsity, not how to choose learning rate, for 
clarity, we use a constant learning rate in the analysis because it leads to simpler bounds. 

The above method has been widely used in online learning such as [10] and [2]. Moreover, it is 
argued to be efficient even for solving batch problems where we repeatedly run the online algorithm 
over training data multiple times. For example, the idea has been successfully applied to solve 
large-scale standard SVM formulations [11, 14]. In the scenario outlined in the introduction, online 
learning methods are more suitable than some traditional batch learning methods. 

However, a main drawback of (3) is that it does not achieve sparsity, which we address in this 
paper. Note that in the literature, this particular update rule is often referred to as gradient descent 
(GD) or stochastic gradient descent (SGD). There are other variants, such as exponentiated gradient 
descent (EG). Since our focus in this paper is sparsity, not GD versus EG, we shall only consider 
modifications of (3) for simplicity. 

3 Sparse Online Learning 

In this section, we examine several methods for achieving sparsity in online learning. The first idea 
is simple coefficient rounding, which is the most natural method. We will then consider its full 
online implementation, and another method which is the online counterpart of L\ regularization in 
batch learning. As we shall see, all these ideas are closely related. 

3.1 Simple Coefficient Rounding 

In order to achieve sparsity, the most natural method is to round small coefficients (that are no 
larger than a threshold 9 > 0) to zero after every K online steps. That is, if i/K is not an integer, 
we use the standard GD rule in (3); if i/K is an integer, we modify the rule as: 



where for a vector v = [v\,...,Vd] G R d , and a scalar 9 > 0, Tq(v,9) = [Tq(vi, 9), . . . , To(vd, 9)], 
with 



That is, we first perform a standard stochastic gradient descent rule, and then round the updated 
coefficients toward zero. The effect is to remove nonzero and small components in the weight vector. 

In general, we should not take K = 1, especially when 77 is small, since each step modifies Wi 
by only a small amount. If a coefficient is zero, it remains small after one online update, and the 
rounding operation pulls it back to zero. Consequently, rounding can be done only after every K 
steps (with a reasonably large K); in this case, nonzero coefficients have sufficient time to go above 
the threshold 9. However, if K is too large, then in the training stage, we will need to keep many 
more nonzero features in the intermediate steps before they are rounded to zero. In the extreme 



f(wi) = T (wi - rjViL{wi, zi 



),*) 



(4) 
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case, we may simply round the coefficients in the end, which does not solve the storage problem in 
the training phase. The sensitivity in choosing appropriate K is a main drawback of this method; 
another drawback is the lack of theoretical guarantee for its online performance. 

The above mentioned issues motivate us to consider more principled sparse online learning 
methods. In section 3.3, we derive an online version of rounding using an idea called truncated 
gradient for which regret bounds hold. 



3.2 A Sub-gradient Algorithm for L x Regularization 

In our experiments, we combine rounding-in-the-end-of-training with a simple online sub-gradient 
method for L\ regularization with a regularization parameter g > 0: 

f(wi) = Wi - r]ViL(wi,Zi) - r)gsgn(wi), (5) 

where for a vector v = [vi, . . . ,Vd], sgn(v) = [sgn(fi), . . . ,sgn(vd)], and sgn(fj) = 1 when Vj > 0, 
sgn(wj) = —1 when Vj < 0, and sgn(vj) = when Vj = 0. In the experiments, the online method 
(5) plus rounding in the end is used as a simple baseline. One should note that this method does 
not produce sparse weights online. Therefore it does not handle large-scale problems for which we 
cannot keep all features in memory. 



3.3 Truncated Gradient 

In order to obtain an online version of the simple rounding rule in (4), we observe that the direct 
rounding to zero is too aggressive. A less aggressive version is to shrink the coefficient to zero by a 
smaller amount. We call this idea truncated gradient. 

The amount of shrinkage is measured by a gravity parameter gi > 0: 

f(wi) = Ti{wi - rjViL(wi, Zi),r)gi, 8), (6) 

where for a vector v = [vi, . . . , Vd] £ R d , and a scalar g > 0, T±(v, a, 9) = [Ti(v±, a, 6), . . . , T\(vd, a, 6)], 
with 

max(0, Vj — a) if Vj G [0, 6>] 
Ti(vj,a, 9) = I min(0, vj + a) if vj G [-6, 0] . 

_ Vj otherwise 

Again, the truncation can be performed every K online steps. That is, if i/K is not an integer, we 
let (ft = 0; if i/K is an integer, we let gi = Kg for a gravity parameter g > 0. This particular choice 
is equivalent to (4) when we set g such that rjKg > 8. This requires a large g when r\ is small. In 
practice, one should set a small, fixed g, as implied by our regret bound developed later. 

In general, the larger the parameters g and 6 are, the more sparsity is incurred. Due to the 
extra truncation T\, this method can lead to sparse solutions, which is confirmed in our experiments 
described later. In those experiments, the degree of sparsity discovered varies with the problem. 

A special case, which we use in the experiment, is to let g = 9 in (6). In this case, we can 
use only one parameter g to control sparsity. Since rjKg <^ 9 when r\K is small, the truncation 
operation is less aggressive than the rounding in (4). At first sight, the procedure appears to be an 
ad-hoc way to fix (4). However, we can establish a regret bound for this method, showing that it is 
theoretically sound. 
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Another important special case of (6) is setting 9 = oo. This leads to the following update rule 
for every K-th online step 



f(wi) = T(wi - r]ViL(wi,Zi),girf), 



(7) 



where for a vector v = [vi, . . . , Vd] € R d , and a scalar g > 0, T(v, a) = [T(vi,a), . . . , T(vd, a)], with 



The method is a modification of the standard sub-gradient online method for L\ regularization in 
(5). The parameter gi > controls the sparsity that can be achieved with the algorithm. Note 
that when gi = 0, the update rule is identical to the standard stochastic gradient descent rule. In 
general, we may perform a truncation every K steps. That is, if i/K is not an integer, we let gi = 0; 
if i/K is an integer, we let </$ = Kg for a gravity parameter g > 0. The reason for doing so (instead 
of a constant g) is that we can perform a more aggressive truncation with gravity parameter Kg 
after each K steps. This can potentially lead to better sparsity. 

The procedure in (7) can be regarded as an online counterpart of L\ regularization in the sense 
that it approximately solves an L\ regularization problem in the limit of r\ — ► 0. Truncated gradient 
descent for L\ regularization is different from the naive application of stochastic gradient descent 
rule (3) with an added L\ regularization term. As pointed out in the introduction, the latter fails 
because it rarely leads to sparsity. Our theory shows that even with sparsification, the prediction 
performance is still comparable to that of the standard online learning algorithm. In the following, 
we develop a general regret bound for this general method, which also shows how the regret may 
depend on the sparsification parameter g. 

3.4 Regret Analysis 

Throughout the paper, we use || • ||i for 1-norm, and || • || for 2-norm. For reference, we make the 
following assumption regarding the loss function: 

Assumption 3.1 We assume that L(w,z) is convex in w, and there exist non-negative constants 
A and B such thai {ViL(w, z)) 2 < AL(w, z) + B for all w G R d and z <E R d+1 . 

For linear prediction problems, we have a general loss function of the form L(w,z) = <p(w x,y). 
The following are some common loss functions </>(•,•) with corresponding choices of parameters A 
and B (which are not unique), under the assumption that sup^ ||x|| < C. 

• Logistic: 4>(p,y) = ln(l + exp(— py)); ^4 = and B = C 2 . This loss is for binary classification 
problems with y G {±1}. 

• SVM (hinge loss): 4>(p,y) = max(0, 1 — py); A = and B = C 2 . This loss is for binary 
classification problems with y G {±1}. 

• Least squares (square loss): (f)(p,y) = (p — y) 2 ', A = 4C 2 and B = 0. This loss is for regression 



Our main result is Theorem 3.1 that is parameterized by A and B. The proof is left to the 
appendix. Specializing it to particular losses yields several corollaries. The one that can be applied 
to the least squares loss will be given later in Corollary 4.1. 




max(0, Vj — a) if Vj > 
min(0, Vj + a) otherwise 



problems. 
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Theorem 3.1 (Sparse Online Regret) Consider sparse online update rule (7) with w± = and 
r] > 0. // Assumption 3.1 holds, then for all w 6 R d we have 

L(wi,Zi) + - _ q 5A \\ w i+i • K w i+i <0)\\i 

1 T 

i=i 

where for vectors v = [vi, . . . , vj\ and v' = [v[, . . . , v' d ], we let 

d 

ll^/(Kl<^)lli = ^K|/(Ki<^), 

where I(-) is the set indicator function. 

We state the theorem with a constant learning rate -q. As mentioned earlier, it is possible to 
obtain a result with variable learning rate where r] = rji decays as i increases. Although this may 
lead to a no-regret bound without knowing T in advance, it introduces extra complexity to the 
presentation of the main idea. Since our focus is on sparsity rather than optimizing learning rate, 
we do not include such a result for clarity. If T is known in advance, then in the above bound, one 
can simply take rj = 0(1/ VT) and the regret is of order 0(1/ VT). 

In the above theorem, the right-hand side involves a term gi\\w • I(wi + \ < 9)\\i that depends on 
Wi + i which is not easily estimated. To remove this dependency, a trivial upper bound of 9 = oo 
can be used, leading to L\ penalty <7j||u)||i. In the general case of 9 < oo, we cannot remove the 
itfi+i dependency because the effective regularization condition (as shown on the left-hand side) is 
the non-convex penalty gi\\w ■ I(\w\ < 9)\\\. Solving such a non-convex formulation is hard both in 
the online and batch settings. In general, we only know how to efficiently discover a local minimum 
which is difficult to characterize. Without a good characterization of the local minimum, it is not 
possible for us to replace gi\\w ■ /(itfj+i < 0)||i on the right-hand side by gi\\w ■ I(w < 9)\\i because 
such a formulation would have implied that we could efficiently solve a non-convex problem with a 
simple online update rule. Still, when 9 < oo, one naturally expects that the right-hand side penalty 
gi\\w ■ I{wi + \ < 9)\\i is much smaller than the corresponding L\ penalty especially when Wj 

has many components that are close to 0. Therefore the situation with 9 < oo can potentially yield 
better performance on some data. This is confirmed in our experiments. 

Theorem 3.1 also implies a trade-off between sparsity and regret performance. We may simply 
consider the case where g\ = g is a constant. When g is small, we have less sparsity but the regret 
term g\\w ■ I(wi+i < 9)\\i < g\\w\\i on the right-hand side is also small. When g is large, we are able 
to achieve more sparsity but the regret g\\w • I{wi + \ < 9)\\\ on the right-hand side also becomes 
large. Such a trade-off (sparsity versus prediction accuracy) is empirically studied in Section 6. Our 
observation suggests that we can gain significant sparsity with only a small decrease of accuracy 
(that is, using a small g). 

Now consider the case 9 = oo and gi = g. When T — > oo, if we let 7] — > and r]T — ► oo, then 
Theorem 3.1 implies that 




0.5Ar] 



E 



„■ 1 
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In other words, if we let L'(w, z) = L(w, z)+g\\w\\i be the L\ regularized loss, then the L\ regularized 
regret is small when rj — > and T — > oo. This implies that our procedure can be regarded as the 
online counterpart of Li-regularization methods. In the stochastic setting where the examples are 
drawn iid from some underlying distribution, the sparse online gradient method proposed in this 
paper solves the L\ regularization problem. 



3.5 Stochastic Setting 

Stochastic-gradient-based online learning methods can be used to solve large-scale batch optimiza- 
tion problems, often quite successfully [11, 14]. In this setting, we can go through training examples 
one-by-one in an online fashion, and repeat multiple times over the training data. In this section, 
we analyze the performance of such a procedure using Theorem 3.1. 

To simplify the analysis, instead of assuming that we go through the data one by one, we assume 
that each additional data point is drawn from the training data randomly with equal probability. 
This corresponds to the standard stochastic optimization setting, in which observed samples are iid 
from some underlying distributions. The following result is a simple consequence of Theorem 3.1. 
For simplicity, we only consider the case with = oo and constant gravity gi = g. 



Theorem 3.2 Consider a set of training data z% = (xi,yi) for i 

1 n 

g) = - V L(w, Zi) + g 

n — ■* 



, n, and let 



w\\i 



be the L\ regularized loss over training data. Let w\ = w\ 
wt+i = T{w t - rjS7i(w u z it ),grj), w t +i : 



■ 0, and define recursively for t 
w t + (w t +i - wt)/{t + 1), 



1,2, 



where each it is drawn from, {1, . . . , n} uniformly at random. If Assumption 3.1 holds, then at any 
time T, the following inequalities are valid for all w 6 R d : 



E 



n,...,i T 



(1 - 0.5Ar])R w t , 



9 



<En,.. 



,»T 



1 - 0.5^?7 



T 



IV, 



l-0.5ArjJ 
9 



1 - 0.5^7? 



<\B + 



w 



2r]T 



+ R{w,g). 



PROOF. Note that the recursion of wt implies that 



wt 



1 T 

— T 5^ 



Wt 



t=l 



from telescoping the update rule. 

Because R(w,g) is convex in w, the first inequality follows directly from Jensen's inequality. 
In the following we only need to prove the second inequality. Theorem 3.1 implies the following: 



1 - O.SArj 



T 

E 

t=i L 



L(w t ,z it ) + 



_9 

0.5^7? 



w t l 



V 



\w\ 



T 



(8) 
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Observe that 



L(w t ,z it ) + 



9 



1 - 0.5Ari 



w t 1 



.9 



1 - 0.5A?? 



and 



1 T 



t=l 



R(w,g). 



The second inequality is obtained by taking the expectation with respect to E^...^ in (8). 
If we let rj — ► and t/T — > oo, the bound in Theorem 3.2 becomes 



□ 



E[i?(w T ,<7)] <E 



1 T 

-^R(w u g) 



t=i 



< w£R(w,g) + o(l). 



That is, on average wt approximately solves the L\ regularization problem 



inf 

w 



1 " 

-y~]L(w,Zi) + g\\w\\ 1 



If we choose a random stopping time T, then the above inequalities says that on average R(wt) 
also solves this L\ regularization problem approximately. Therefore in our experiment, we use the 
last solution wt instead of the aggregated solution wt- 

Since 1-norm regularization is frequently used to achieve sparsity in the batch learning setting, 
the connection to 1-norm regularization can be regarded as an alternative justification for the 
sparse-online algorithm developed in this paper. 



4 Truncated Gradient Algorithm for Least Squares 

The method in Section 3 can be directly applied to least squares regression. This leads to Algorithm 
1 which implements sparsification for square loss according to equation (7). In the description, we 
use superscripted symbol to denote the j-th component of vector w (in order to differentiate 
from Wi, which we have used to denote the i-th weight vector). For clarity, we also drop the index 
i from Wi. Although we keep the choice of gravity parameters gi open in the algorithm description, 
in practice, we only consider the following choice: 




Kg if i/K is an integer 
otherwise 



This may give a more aggressive truncation (thus sparsity) after every K-th. iteration. Since we do 
not have a theorem formalizing how much more sparsity one can gain from this idea, its effect will 
only be examined through experiments. 

In many online learning situations (such as web applications), only a small subset of the features 
have nonzero values for any example x. It is thus desirable to deal with sparsity only in this 
small subset rather than all features, while simultaneously inducing sparsity on all feature weights. 
Moreover, it is important to store only features with non-zero coefficients (if the number of features 
is so large that it cannot be stored in memory, this approach allows us to use a hashtable to track 
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Algorithm 1 Truncated Gradient 
Inputs: 

• threshold 9 > 

• gravity sequence g% > 

• learning rate r\ G (0, 1) 

• example oracle O 

initialize weights w J ^- (j = 1, . . . , d) 
for trial i = 1,2,... 

1. Acquire an unlabeled example 1 from oracle O 

2. forall weights w J (j = 1, . . . , d) 

(a) if w J > and w 3 < 9 then w 3 <— max{w J — 5^77, 0} 

(b) elseif w 3 < and w 3 > —9 then w 3 <— minlu;- 7 + ^77, 0} 

3. Compute prediction: y = ^2,-w 3 x 3 

4. Acquire the label y from oracle 

5. Update weights for all features j: w 3 <— it;- 7 + 2r/(y — y)^- 7 



only the nonzero coefficients). We describe how this can be implemented efficiently in the next 
section. 

For reference, we present a specialization of Theorem 3.1 in the following corollary that is directly 
applicable to Algorithm 1. 

Corollary 4.1 (Sparse Online Square Loss Regret) If there exists C > such that for all x, \\x\\ < 
C , then for all w G R d , we have 



1 - 2C 2 rj ^ 

12 * T 



[wfxi - yi ) 2 + - _ 9 2C 2 r} Wi ' 7 ^ Wi ' - 



where Wi = [u; 1 , . . . , w d ] G -R d is i/ie weight vector used for prediction at the i-th step of Algorithm 1; 
(xi,yi) is the data point observed at the i-step. 

This corollary explicitly states that the average square loss incurred by the learner (left term) is 
bounded by the average square loss of the best weight vector w, plus a term related to the size of 
w which decays as 1/T and an additive offset controlled by the spar si ty threshold 9 and the gravity 
parameter g^. 
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5 Efficient Implementation 



We altered a standard gradient- descent implementation (Vowpal Wabbit [7]) according to algorithm 
1. Vowpal Wabbit optimizes square loss on a linear representation w-x via gradient descent (3) with 
a couple caveats: 

1. The prediction is normalized by the square root of the number of nonzero entries in a sparse 
vector, w ■ x/\x\q 5 . This alteration is just a constant rescaling on dense vectors which is 
effectively removable by an appropriate rescaling of the learning rate. 

2. The prediction is clipped to the interval [0,1], implying that the loss function is not square 
loss for undipped predictions outside of this dynamic range. Instead the update is a constant 
value, equivalent to the gradient of a linear loss function. 

The learning rate in Vowpal Wabbit is controllable, supporting 1/i decay as well as a constant 
learning rate (and rates in-between). The program operates in an entirely online fashion, so the 
memory footprint is essentially just the weight vector, even when the amount of data is very large. 

As mentioned earlier, we would like the algorithm's computational complexity to depend linearly 
on the number of nonzero features of an example, rather than the total number of features. The 
approach we took was to store a time-stamp Tj for each feature j. The time-stamp was initialized 
to the index of the example where feature j was nonzero for the first time. During online learning, 
we simply went through all nonzero features j of example i, and could "simulate" the shrinkage of 
wi after Tj in a batch mode. These weights are then updated, and their time stamps are reset to 
i. This lazy-update idea of delaying the shrinkage calculation until needed is the key to efficient 
implementation of truncated gradient. Specifically, instead of using update rule (6) for weight to 7 ', 
we shrunk the weights of all nonzero feature j differently by the following: 



and Tj is updated by 



We note that such a lazy-update trick by maintaining the time-stamp information can be applied 
to the other two algorithms given in section 3. In the coefficient rounding algorithm (4), for instance, 
for each nonzero feature j of example i, we can first perform a regular gradient descent on the square 
loss, and then do the following: if \vjj\ is below the threshold 9 and i > Tj + K, we round Wj to 
and set Tj to i. 

This implementation shows that the truncated gradient method satisfies the following require- 
ments needed for solving large scale problems with sparse features. 

• The algorithm is computationally efficient: the number of operations per online step is linear 
in the number of nonzero features, and independent of the total number of features. 

• The algorithm is memory efficient: it maintains a list of active features, and a feature can be 
inserted when observed, and deleted when the corresponding weight becomes zero. 
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If we apply the online projection idea in [15] to solve (1), then in the update rule (7), one 
has to pick the smallest gi > such that < s. We do not know an efficient method 

to find this specific gi using operations independent of the total number of features. A standard 
implementation relies on sorting all weights, which requires 0(<iln d) operations, where d is the total 
number of (nonzero) features. This complexity is unacceptable for our purpose. However, we shall 
point out that in an important recent work [5], the authors proposed an efficient online ^-projection 
method. The idea is to use a balanced tree to keep track of weights, which allows efficient threshold 
finding and tree updates in O(klnd) operations on average (here, k denotes the number of nonzero 
coefficients in the current training example). Although the algorithm still has weak dependency on 
d, it is applicable to large scale practical applications. The theoretical analysis presented in this 
paper shows that we can obtain a meaningful regret bound by picking an arbitrary g^. This is useful 
because the resulting method is much simpler to implement and is computationally more efficient 
per online step. Moreover, our method allows non-convex updates that are closely related to the 
simple coefficient rounding idea. Due to the complexity of implementing the balanced tree strategy 
in [5], we shall not compare to it in this paper. 

6 Empirical Results 

We applied Vowpal Wabbit with the efficiently implemented sparsify option, as described in the 
previous section, to a selection of datasets, including eleven datasets from the UCI repository [1], 
the much larger dataset rcvl [8], and a private large-scale dataset Big_Ads related to ad interest 
prediction. While UCI datasets are useful for benchmark purposes, rcvl and Big_Ads are more 
interesting since they embody real-world datasets with large numbers of features, many of which 
are less informative for making predictions than others. The datasets are summarized in Table 1. 

The UCI datasets we used do not have many features, and it is expected that a large fraction 
of these features are useful for making predictions. For comparison purposes as well as to better 
demonstrate the behavior of our algorithm, we also added 1000 random binary features to those 
datasets. Each feature has value 1 with probability 0.05 and otherwise. 

6.1 Feature Sparsification of Truncated Gradient Descent 

In the first set of experiments, we are interested in how much reduction in the number of features 
is possible without affecting learning performance significantly; specifically, we require the accuracy 
be reduced by no more than 1% for classification tasks, and the total square loss be increased by 
no more than 1% for regression tasks. As common practice, we allowed the algorithm to run on the 
training data set for multiple passes with decaying learning rate. For each dataset, we performed 
10-fold cross validation over the training set to identify the best set of parameters, including the 
learning rate r], the sparsification rate g, number of passes of the training set, and the decay of 
learning rate across these passes. This set of parameters was then used to train Vowpal Wabbit on 
the whole training set. Finally, the learned classifier /regressor is evaluated on the test set. We fixed 
K = 1 and 9 = oo in these experiments, and will study the effects of K and 6 in later subsections. 

Figure 1 shows the fraction of reduced features after sparsification is applied to each dataset. 
For UCI datasets with randomly added features, Vowpal Wabbit is able to reduce the number of 
features by a fraction of more than 90%, except for the ad dataset in which only 71% reduction 
is observed. This less satisfying result might be improved by a more extensive parameter search 
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Table 1: Dataset Summary. 



Dataset 


^features 


Strain data 


#test data 


task 


ad 


1411 


2455 


824 


classification 


crx 


47 


526 


164 


classification 


housing 


14 


381 


125 


regression 


krvskp 


74 


2413 


783 


classification 


magic04 


11 


14226 


4794 


classification 


mushroom 


117 


6079 


2045 


classification 


spambase 


58 


3445 


1156 


classification 


wbc 


10 


520 


179 


classification 


wdbc 


31 


421 


148 


classification 


wpbc 


33 


153 


45 


classification 


zoo 


17 


77 


24 


regression 


rcvl 


38853 


781265 


23149 


classification 


Big_Ads 


3 x 10 9 


26 x 10 6 


2.7 x 10 6 


classification 



in cross validation. However, if we can tolerate 1.3% decrease in accuracy (instead of 1% as for 
other datasets) during cross validation, Vowpal Wabbit is able to achieve 91.4% reduction, indicating 
that a large reduction is still possible at the tiny additional cost of 0.3% accuracy loss. With this 
slightly more aggressive sparsification, the test-set accuracy drops from 95.9% (when only 1% loss 
in accuracy is allowed in cross validation) to 95.4%, while the accuracy without sparsification is 
96.5%. 

Even for the original UCI datasets without artificially added features, Vowpal Wabbit manages 
to filter out some of the less useful features while maintaining the same level of performance. For 
example, for the ad dataset, a reduction of 83.4% is achieved. Compared to the results above, it 
seems the most effective feature reductions occur on datasets with a large number of less useful 
features, exactly where sparsification is needed. 

For rcvl, more than 75% of features are removed after the sparsification process, indicating the 
effectiveness of our algorithm in real-life problems. We were not able to try many parameters in 
cross validation because of the size of rcvl. It is expected that more reduction is possible when a 
more thorough parameter search is performed. 

The previous results do not exercise the full power of the approach presented here because 
they are applied to datasets where standard Lasso regularization [13] is or may be computationally 
viable. We have also applied this approach to a large non-public dataset Big_Ads where the goal is 
predicting which of two ads was clicked on given context information (the content of ads and query 
information). Here, accepting a 0.009 increase in classification error allows us to reduce the number 
of features from about 3 x 10 9 to about 24 x 10 6 , a factor of 125 decrease in the number of features. 

For classification tasks, we also study how our sparsification solution affects AUC (Area Under 
the ROC Curve), which is a standard metric for classification. 1 Using the same sets of parameters 
from 10-fold cross validation described above, we find that the criterion is not affected significantly 
by sparsification and in some cases, they are actually slightly improved. The reason may be that our 

1 We use AUC here and in later subsections because it is insensitive to threshold, which is unlike accuracy. 
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Figure 1: A plot showing the amount of features left after sparsification for each dataset. The 
first result is the fraction of features left when the performance is changed by at most 1% due to 
sparsification. The second result is the % sparsification when 1000 random features are added to 
each example. For rcvl and Big_Ads there is no second column, since the experiment is not useful. 

sparsification method remove some of the features that could have confused Vowpal Wabbit . The 
ratios of the AUC with and without sparsification for all classification tasks are plotted in Figures 2. 
It is often the case that these ratios are above 98%. 

6.2 The Effects of K 

As we argued before, using a K value larger than 1 may be desired in truncated gradient and the 
rounding algorithms. This advantage is empirically demonstrated here. In particular, we try K = 1, 
K = 10, and K = 20 in both algorithms. As before, cross validation is used to select parameters 
in the rounding algorithm, including learning rate r], number of passes of data during training, and 
learning rate decay over training passes. 

Figures 3 and 4 give the AUC vs. number- of- feature plots, where each data point is generated 
by running respective algorithm using a different value of g (for truncated gradient) and 6 (for the 
rounding algorithm). We used 6 = oo in truncated gradient. 

For truncated gradient, the performances with K = 10 or 20 are at least as good as those with 
K = 1, and for the spambase dataset further feature reduction is achieved at the same level of 
performance, reducing the number of features from 76 (when K = 1) to 25 (when K = 10 or 20) 
with of an AUC of about 0.89. 

Such an effect is even more remarkable in the rounding algorithm. For instance, in the ad dataset 
the algorithm using K = 1 achieves an AUC of 0.94 with 322 features, while 13 and 7 features are 
needed using K = 10 and K = 20, respectively. 

6.3 The Effects of 6 in Truncated Gradient 

In this subsection, we empirically study the effect of 6 in truncated gradient. The rounding algorithm 
is also included for comparison due to its similarity to truncated gradient when 8 = g. As before, we 
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Figure 2: A plot showing the ratio of the AUC when sparsification is used over the AUC when 
no sparsification is used. The same process as in Figure 1 is used to determine empirically good 
parameters. The first result is for the original dataset, while the second result is for the modified 
dataset where 1000 random features are added to each example. 

used cross validation to choose parameters for each 9 value tried, and focused on the AUC metric 
in the eight UCI classification tasks, except the degenerate one of wpbc. We fixed K = 10 in both 
algorithm. 

Figure 5 gives the AUC vs. number- of- feature plots, where each data point is generated by 
running respective algorithms using a different value of g (for truncated gradient) and 9 (for the 
rounding algorithm). A few observations are in place. First, the results verify the observation that 
the behavior of truncated gradient with 9 = g is similar to the rounding algorithm. Second, these 
results suggest that, in practice, it may be desired to use 9 = oo in truncated gradient because it 
avoids the local minimum problem. 

6.4 Comparison to Other Algorithms 

The next set of experiments compares truncated gradient descent to other algorithms regarding their 
abilities to tradeoff feature sparsification and performance. Again, we focus on the AUC metric in 
UCI classification tasks except wpdc. The algorithms for comparison include: 

• The truncated gradient algorithm: We fixed K = 10 and 9 = oo, used crossed-validated 
parameters, and altered the gravity parameter g. 

• The rounding algorithm described in section 3.1: We fixed K = 10, used cross-validated 
parameters, and altered the rounding threshold 9. 

• The subgradient algorithm described in section 3.2: We fixed K = 10, used cross-validated 
parameters, and altered the regularization parameter g. 

• The Lasso [13] for batch L\ regularization: We used a publicly available implementation [12]. 
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Note that we do not attempt to compare these algorithms on rcvl and Big_Ads simply because 
their sizes are too large for the Lasso and subgradient descent (c.f., section 5). 

Figure 6 gives the results. First, it is observed that truncated gradient is consistently competitive 
with the other two online algorithms and significantly outperformed them in some problems. This 
suggests the effectiveness of truncated gradient. 

Second, it is interesting to observe that the qualitative behavior of truncated gradient is often 
similar to that of LASSO, especially when very sparse weight vectors are allowed (the left sides in 
the graphs). This is consistent with theorem 3.2 showing the relation between these two algorithms. 
However, LASSO usually has worse performance when the allowed number of nonzero weights is set 
too large (the right side of the graphs). In this case, LASSO seems to overfit. In contrast, truncated 
gradient is more robust to overfitting. The robustness of online learning is often attributed to early 
stopping, which has been extensively discussed in the literature (e.g., in [14]). 

Finally, it is worth emphasizing that the experiments in this subsection try to shed some light 
on the relative strengths of these algorithms in terms of feature sparsification. For large datasets 
such as Big_Ads only truncated gradient, coefficient rounding, and the sub-gradient algorithms 
are applicable to large-scale problems with sparse features. As we have shown and argued, the 
rounding algorithm is quite ad hoc and may not work robustly in some problems, and the sub- 
gradient algorithm does not lead to sparsity in general during training. 

7 Conclusion 

This paper covers the first sparsification technique for large-scale online learning with strong the- 
oretical guarantees. The algorithm, truncated gradient, is the natural extension of Lasso-style 
regression to the online-learning setting. Theorem 3.1 proves that the technique is sound: it never 
harms performance much compared to standard stochastic gradient descent in adversarial situations. 
Furthermore, we show that the asymptotic solution of one instance of the algorithm is essentially 
equivalent to Lasso regression, and thus justifying the algorithm's ability to produce sparse weight 
vectors when the number of features is intractably large. 

The theorem is verified experimentally in a number of problems. In some cases, especially for 
problems with many irrelevant features, this approach achieves a one or two order of magnitude 
reduction in the number of features. 
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A Proof of Theorem 3.1 



The following lemma is the essential step in our analysis. 

Lemma A.l For update rule (6) applied to weight vector w on example z = (x,y) with gravity 
parameter gi = g, resulting in a weight vector w' . If Assumption 3.1 holds, then for all w 6 R d , we 
have 

(1 - 0.5Ar])L(w, z) + g\\w' ■ I(\w'\ < 0)||i 

w, / 1 „ N „ 77 „ \\w — W II 2 — \\w — w'\\ 2 

<L(w, z) + g\\w ■ I(\w'\ < 0)11! + jB + II ~~2^ ~- 



17 



PROOF. Consider any target vector w 6 R d and let w = w — rjS7\L{w,z). We have w' = 
T!(w,gri,0). Let 

u(w,w') = g\\w-I(\w'\ < e)\\i- g\\w' ■ I(\w'\ < 9)\\i. 
Then the update equation implies the following: 

•j'\\ 2 

•'II 2 + \\w' - w\\ 2 
| 2 - 2(w - w') T (w' - w) 
| 2 + 2r)u(w, w') 

| 2 + ||it7 — w\\ 2 + 2(w — w) T (w — w) + 2rju{w, w') 
| 2 + r/ 2 ||ViL(u;,z)|| 2 + 2rj(w - w) T ViL(w,z) + 2rju(w,w') 



< 



< 



w — w 
w — w 
w — w 
w — w 
w — w 



w — w 
w — w 
w — w 



2 + 7/ 2 ||ViL(u>,z)|| 2 + 2r](L(w,z) - L(w,z)) +2r]u(w,w') 
2 + 7] 2 (AL(w, z)+B)+ 2r)(L(w, z) - L(w, z)) + 2r]u(w, w'). 



Here, the first and second equalities follow from algebra, and the third from the definition of w. 
The first inequality follows because a square is always non-negative. The second inequality follows 
because w' = Ti(w, grj, 6), which implies that (w' — w) T w' = —grj\\w'-I(\w'\ < 6)\\i and \w'j — Wj\ < 
gr)I(\w'j\ < 9). Therefore 



— (w — w') T {w' — w) = — w T (w' — w) + w' T (w' — w) 



d 

—9 r l'^2 \™j\I(\ w j\ — ^) + ( w> ~ w) T w' = r)u(w,w f ). 

3=1 

The third inequality follows from the definition of sub-gradient of a convex function, which implies 
that 

(w — w) ViL(w, z) < L(w, z) — L(w, z) 

for all w and w. The fourth inequality follows from Assumption 3.1. Rearranging the above 
inequality leads to the desired bound. □ 
PROOF, (of theorem 3.1) Apply Lemma A.l to the update on trial i, we have 

(1 - Q.5Ar])L(wi, zi) + gi\\w i+1 ■ I(\w i+1 \ < 9)\\i 



<L(w,Zi) + 



\W — Wi\ 



\w - W i+ l\ 



2r] 



+ g i \\w-I(\w i+1 \<9)\\ 1 + jB. 
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Now summing over i = 1, 2, . . . , T, we obtain 

T 



K 1 " 0.5^r ? )L(u; i , + 5i |K + i ■ ^(K+i| < e )h] 



i=i 

T 



1=1 



1 10 — || 2 — ||l/J — Wj + l|| 2 



2?? 



+ L(w, Zi) + ^||u) • /(K+i| < 0)||i + ^-B 



\w — wi || 2 — || — wt|| 2 



2r? 



+ ^TB + J}L(u>, Zj) + gi \\w ■ I{\w i+l \ < 0)||i] 
i=i 



< 1 ^ + ^TS + ^[L(u),z i ) + < 7i ||u)-/(|^ + i| <e)||i]. 



277 



i=l 



The first equality follows from the telescoping sum and the second inequality follows from the initial 
condition (all weights are zero) and dropping negative quantities. The theorem follows by dividing 
with respect to T and rearranging terms. □ 
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Figure 3: Effect of K on AUC in truncated gradient. 
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Figure 4: Effect of K on AUC in the rounding algorithm. 
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Figure 5: Effect of 9 on AUC in truncated gradient. 
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Figure 6: Comparison of four algorithms. 



